Audio-visual speaker conversion using prosody features
نویسندگان
چکیده
The article presents a joint audio-video approach towards speaker identity conversion, based on statistical methods originally introduced for voice conversion. Using the experimental data from the 3D BIWI Audiovisual corpus of Affective Communication, mapping functions are built between each two speakers in order to convert speaker-specific features: speech signal and 3D facial expressions. The results obtained by combining audio and visual features are compared to corresponding results from earlier approaches, while outlining the improvements brought by introducing dynamic features and exploiting prosodic features.
منابع مشابه
Automatic Building of Synthetic Voices from Audio Books
Current state-of-the-art text-to-speech systems produce intelligible speech but lack the prosody of natural utterances. Building better models of prosody involves development of prosodically rich speech databases. However, development of such speech databases requires a large amount of effort and time. An alternative is to exploit story style monologues (long speech files) in audio books. These...
متن کاملAudio-Visual Correlation Modeling for Speaker Identification and Synthesis
This thesis addresses two major problems of multimodal signal processing using audiovisual correlation modeling: speaker recognition and speaker synthesis. We address the first problem, i.e., the audiovisual speaker recognition problem within an open-set identification framework, where audio (speech) and lip texture (intensity) modalities are fused employing a combination of early and late inte...
متن کاملAutomatic speaker recognition as a measurement of voice imitation and conversion
Voices can be deliberately disguised by means of human imitation or voice conversion. The question arises to what extent they can be modified by using either method. In the current paper, a set of speaker identification experiments are conducted; first, analysing some prosodic features extracted from voices of professional impersonators attempting to mimic a target voice and, second, using both...
متن کاملHierarchical modeling of F0 contours for voice conversion
Voice conversion systems deal with the conversion of a speech signal to sound as if it was uttered by another speaker. The conversion of the spectral features has attracted a lot of research attention but the conversion of pitch, modeling the speakerdependent prosody, is often achieved by just controlling the F0 level and range. However, the detailed prosody, including different linguistic unit...
متن کاملProsodic features for speaker verification
In this paper we study the effectiveness of prosodic features for speaker verification. We hypothesize that prosody is linked to linguistic units such as syllables and prosodic features can be better represented with reference to the syllabic sequence. For extracting prosodic features, speech is segmented into syllablelike regions using the knowledge of vowel onset points (VOP). We use a techni...
متن کامل